Goto

Collaborating Authors

 batch imitation learning


Strictly Batch Imitation Learning by Energy-based Distribution Matching

Neural Information Processing Systems

Consider learning a policy purely on the basis of demonstrated behavior---that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. This problem arises wherever live experimentation is costly, such as in healthcare. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy (i.e.


Review for NeurIPS paper: Strictly Batch Imitation Learning by Energy-based Distribution Matching

Neural Information Processing Systems

Additional Feedback: - The authors note (with references) that the pure behavioral cloning approach performs poorly as it doesn't use information about the dynamics and state distributions of the problem. It would be useful if the authors could present a short concrete example of exactly what type of information is lost when ignoring the MDP structure. At a first read it feels like it implies the off-line setting means we have all the information we *need* from the start, which I think is the opposite of what the authors are trying to say. - Line 112 - This sentence immediately brings to mind a decision between parametric vs. non-parametric methods. I don't think that's what the authors are trying to say so maybe the terminology of "parameterizing a policy" should be changed throughout the paper. If it is what the authors are trying to say, then it is not made clear why a parametric approach is the correct choice.


Review for NeurIPS paper: Strictly Batch Imitation Learning by Energy-based Distribution Matching

Neural Information Processing Systems

All reviewers unanimously agree that the paper makes a nice contribution to imitation learning in the batch setting. That said, the paper has two major weaknesses: 1. During the discussion, the reviewers expressed confidence that the authors understand the mistake and know how to address it (see e.g., the post-rebuttal update of R4). Therefore, we are recommending acceptance conditioned on that the authors take this issue seriously, correct the technical mistake, and remove any incorrect or misleading claims associated with it. The authors are strongly recommended to add such a comparison in the camera-ready version. On a related note, while the algorithm only uses (s,a) pairs as data, trajectory data is often available, from which one can extract (s,a,r,s') pairs.


Strictly Batch Imitation Learning by Energy-based Distribution Matching

Neural Information Processing Systems

Consider learning a policy purely on the basis of demonstrated behavior---that is, with no access to reinforcement signals, no knowledge of transition dynamics, and no further interaction with the environment. This strictly batch imitation learning problem arises wherever live experimentation is costly, such as in healthcare. One solution is simply to retrofit existing algorithms for apprenticeship learning to work in the offline setting. But such an approach leans heavily on off-policy evaluation or offline model estimation, and can be indirect and inefficient. We argue that a good solution should be able to explicitly parameterize a policy (i.e.